We ensure that the dataset, sourced from the Badminton World Federation (BWF) website and curated by Andrew Zhuang, is loaded into memory in a clean, consistent format. This prepares it for downstream analysis including feature engineering, modelling, and visualization.
The raw dataset is a CSV file containing match-level records from 2018 to 2025. It includes fields such as event type, players, match outcomes, and tournament metadata. Since scraped data can be messy, we apply a series of checks and transformations immediately on import to ensure quality.
3.2 Load the Data
We begin by defining paths and reading the CSV into a pandas DataFrame. Restricting to singles events (MS for Men’s Singles, WS for Women’s Singles) avoids mixing in doubles matches, which require different modelling due to partner synergy.
Required libraries are loaded. The model outputs to the S3 bucket badminton12345, where the training dataset is also stored in csv format.
Code
# train_model.pyimport os, json, inspectfrom pathlib import Pathfrom collections import defaultdictimport numpy as npimport pandas as pdimport networkx as nxfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import StackingClassifierfrom sklearn.metrics import roc_auc_score, accuracy_score, brier_score_lossfrom xgboost import XGBClassifierimport joblib# ---------- config ----------BASE_DIR = Path("/Users/yifanw124/STAT468/stat468-final-project")DATA_PATH = BASE_DIR /"tournaments_2018_2025_June.csv"OUT_DIR = BASE_DIROUT_MODEL = OUT_DIR /"stack_model.joblib"OUT_META = OUT_DIR /"feature_spec.json"PIN_TO_S3 = os.getenv("PIN_TO_S3", "false").lower() =="true"USE_VETIVER_BUNDLE = os.getenv("USE_VETIVER", "false").lower() =="true"RANDOM_STATE =42MODEL_BUCKET = os.getenv("MODEL_BUCKET", "") # used only if PIN_TO_S3MODEL_PIN = os.getenv("MODEL_PIN", "stack_model") # also used as vetiver model_name